{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 决策树：\n",
    "是一种基本的分类与回归方法，此处主要讨论分类的决策树\n",
    "\n",
    "在分类问题中，表示基于特征对实例进行分类的过程，可以认为是if-then的集合，也可以认为是定义在特征空间与类空间上的条件概率分布。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 决策树通常有三个步骤：\n",
    "特征选择、决策树的生成、决策树的修剪。\n",
    "\n",
    "用决策树分类：从根节点开始，对实例的某一特征进行测试，根据测试结果将实例分配到其子节点，此时每个子节点对应着该特征的一个取值，如此递归的对实例进行测试并分配，直到到达叶节点，最后将实例分到叶节点的类中。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![title](pic/8.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 决策树的构造\n",
    "决策树学习的算法通常是一个递归地选择最优特征，并根据该特征对训练数据进行分割，使得各个子数据集有一个最好的分类的过程。这一过程对应着对特征空间的划分，也对应着决策树的构建。\n",
    "\n",
    "1） 开始：构建根节点，将所有训练数据都放在根节点，选择一个最优特征，按着这一特征将训练数据集分割成子集，使得各个子集有一个在当前条件下最好的分类。\n",
    "\n",
    "2） 如果这些子集已经能够被基本正确分类，那么构建叶节点，并将这些子集分到所对应的叶节点去。\n",
    "\n",
    "3）如果还有子集不能够被正确的分类，那么就对这些子集选择新的最优特征，继续对其进行分割，构建相应的节点，如果递归进行，直至所有训练数据子集被基本正确的分类，或者没有合适的特征为止。\n",
    "\n",
    "4）每个子集都被分到叶节点上，即都有了明确的类，这样就生成了一颗决策树。\n",
    "# 决策树的特点：\n",
    "\n",
    "优点：计算复杂度不高，输出结果易于理解，对中间值的缺失不敏感，可以处理不相关特征数据\n",
    "\n",
    "缺点：可能会产生过度匹配的问题\n",
    "\n",
    "适用数据类型：数值型和标称型\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "伪代码：\n",
    "\n",
    "If so return 类标签：\n",
    "\n",
    "Else\n",
    "\n",
    "     寻找划分数据集的最好特征\n",
    "    \n",
    "     划分数据集\n",
    "        \n",
    "     创建分支节点\n",
    "    \n",
    "         for 每个划分的子集\n",
    "        \n",
    "             调用函数createBranch()并增加返回结果到分支节点中\n",
    "            \n",
    "         return 分支节点\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 熵：不确定性\n",
    "为了计算熵，我们需要计算所有类别所有可能值所包含的信息期望值，通过下式得到：\n",
    "\n",
    "\n",
    "$$\n",
    "H=-\\Sigma_{i=1}^{n} p\\left(x_{i}\\right) \\log _{2} p\\left(x_{i}\\right)\n",
    "\\\\\n",
    "$$\n",
    "\n",
    "训练数据集D，则训练数据集D的经验熵为H(D)，|D|表示其样本容量，及样本个数。设有K个类Ck，k = 1,2,3,···,K，|Ck|为属于类Ck的样本个数，这经验熵公式可以写为：\n",
    "$$\n",
    "H(D)=-\\Sigma \\frac{\\left|c_{k}\\right|}{|D|} \\log _{2} \\frac{\\left|c_{k}\\right|}{|D|}\n",
    "\\\\\n",
    "$$\n",
    "\n",
    "比如，在15个人中，有9个贷款，6个不贷款 ，那么熵H（D）为：\n",
    "$$\n",
    "H(D)=-\\frac{9}{15} \\log _{2} \\frac{9}{15}-\\frac{6}{15} \\log _{2} \\frac{6}{15}=0.971\n",
    "$$\n",
    "![title](pic/9.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 信息增益 ：\n",
    "划分数据集的大原则是：将无序数据变得更加有序，但是各种方法都有各自的优缺点，信息论是量化处理信息的分支科学，在划分数据集前后信息发生的变化称为信息增益，获得信息增益最高的特征就是最好的选择，所以必须先学习如何计算信息增益，集合信息的度量方式称为香农熵，或者简称熵。\n",
    "# 条件熵：\n",
    "\n",
    "信息增益表示得知特征X的信息而使得类Y的信息不确定性减少的程度。\n",
    "\n",
    "条件熵H(Y∣X) H(Y|X)H(Y∣X)表示在已知随机变量X的条件下随机变量Y的不确定性，随机变量X给定的条件下随机变量Y的条件熵(conditional entropy) H(Y|X)，定义X给定条件下Y的条件概率分布的熵对X的数学期望：\n",
    "$$\n",
    "H(Y | X)=\\sum_{i=1}^{n} p_{i} H\\left(Y | X=x_{i}\\right)\\\\\n",
    "p_{i}=P\\left(X=x_{i}\\right)\n",
    "\\\\\n",
    "$$\n",
    "\n",
    "信息增益是相对于特征而言的。所以，特征A对训练数据集D的信息增益g(D,A)，定义为集合D的经验熵H(D)与特征A给定条件下D的经验条件熵H(D|A)之差，即：\n",
    "$$\n",
    "g(D, A)=H(D)-H(D | A)\n",
    "\\\\\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def createDataSet():   #简单数据集做测试\n",
    "    dataSet = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]\n",
    "    labels = ['no surfacing', 'flippers']\n",
    "    return dataSet,labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "def creatDataSet2():\n",
    "    # 数据集\n",
    "    dataSet=[[0, 0, 0, 0, 'no'],\n",
    "            [0, 0, 0, 1, 'no'],\n",
    "            [0, 1, 0, 1, 'yes'],\n",
    "            [0, 1, 1, 0, 'yes'],\n",
    "            [0, 0, 0, 0, 'no'],\n",
    "            [1, 0, 0, 0, 'no'],\n",
    "            [1, 0, 0, 1, 'no'],\n",
    "            [1, 1, 1, 1, 'yes'],\n",
    "            [1, 0, 1, 2, 'yes'],\n",
    "            [1, 0, 1, 2, 'yes'],\n",
    "            [2, 0, 1, 2, 'yes'],\n",
    "            [2, 0, 1, 1, 'yes'],\n",
    "            [2, 1, 0, 1, 'yes'],\n",
    "            [2, 1, 0, 2, 'yes'],\n",
    "            [2, 0, 0, 0, 'no']]\n",
    "    #分类属性\n",
    "    labels=['年龄','有工作','有自己的房子','信贷情况']\n",
    "    #返回数据集和分类属性\n",
    "    return dataSet,labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "def systemShannonEnt(dataset):   #计算传入数据的熵\n",
    "    numEntries = len(dataset)\n",
    "    labelCounts = {}\n",
    "    for featVec in dataset:\n",
    "        currentlabel = featVec[-1]\n",
    "        if currentlabel not in labelCounts.keys():\n",
    "            labelCounts[currentlabel] = 0\n",
    "        labelCounts[currentlabel] += 1\n",
    "#         labelCounts[currentlabel] =labelCounts.get(currentlabel,0)+1\n",
    "    ShannonEnt = 0.0\n",
    "    for key in labelCounts.keys():  #迭代key\n",
    "        prob = float(labelCounts[key] / numEntries)\n",
    "        ShannonEnt -= prob * np.log2(prob)    #np.log的参数已经改变，用log2（x），或者np.math.log(x,2)\n",
    "    return ShannonEnt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9709505944546686"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a,b=creatDataSet2()\n",
    "systemShannonEnt(a)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def splitDataSet(dataset,axis,value):  #传入的数据，划分依据的特征，特征的值\n",
    "    retDataSet=[]\n",
    "    for featVec in dataset:\n",
    "        if featVec[axis]==value:\n",
    "            reducedFeatVec=featVec[:axis]\n",
    "            reducedFeatVec.extend(featVec[axis+1:])\n",
    "            retDataSet.append(reducedFeatVec)\n",
    "    return  retDataSet   #返回的数据是不含axis的"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[1, 'yes'], [1, 'yes'], [0, 'no']]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "splitDataSet(a,0,1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def chooseBestFeatureToSplit(dataSet):  #选出最佳分割特征\n",
    "    numFeatures=len(dataSet[0])-1\n",
    "    baseEntropy=systemShannonEnt(dataSet) #计算基础熵\n",
    "    bestInfoGain=0.0 ;bestFeature=-1\n",
    "    for i in range(numFeatures):\n",
    "        featList=[ex[i] for ex in  dataSet]   #第i列特征\n",
    "        uniqueVals=set(featList)              #特征取值集合\n",
    "        newEnt=0.0\n",
    "        for value in uniqueVals:             #计算熵\n",
    "            subDataSet=splitDataSet(dataSet,i,value)\n",
    "            prob=len(subDataSet)/float(len(dataSet))\n",
    "            newEnt+=prob*systemShannonEnt(subDataSet)\n",
    "        InfoGain=baseEntropy-newEnt       #熵增\n",
    "        if InfoGain>bestInfoGain:        #挑选熵增最大的特征\n",
    "            bestInfoGain=InfoGain\n",
    "            bestFeature=i\n",
    "    return bestFeature"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-1"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chooseBestFeatureToSplit(splitDataSet(a,1,0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def majorityCnt(classList):\n",
    "    classCount={}\n",
    "    for vote in classList:\n",
    "        if vote not in classCount.keys() :\n",
    "            classCount[vote]=0\n",
    "        classCount[vote]+=1\n",
    "#         classCount[vote] = classCount.get(vote,0)+1\n",
    "    sortedClassCount=sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)\n",
    "    return sortedClassCount[0][0]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "def createTree(dataSet,labels):  #构建决策树\n",
    "    classList=[ex[-1] for ex in dataSet]   #label\n",
    "    if classList.count(classList[0])==len(classList):  #如果只剩一个label，类别完全相同，停止划分\n",
    "        return classList[0]\n",
    "    if len(dataSet[0])==1:           #遍历完所有的特征，返回出现次数最多的\n",
    "        return majorityCnt(classList)\n",
    "    bestFeat=chooseBestFeatureToSplit(dataSet)\n",
    "    bestFeatLabel=labels[bestFeat]  #mappnig\n",
    "    myTree={bestFeatLabel:{}}\n",
    "    del (labels[bestFeat])\n",
    "    featValues=[ex[bestFeat] for ex in dataSet]\n",
    "    uniquevals=set(featValues)\n",
    "    for value in uniquevals:\n",
    "        myTree[bestFeatLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value),labels)\n",
    "    return myTree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'有自己的房子': {0: {'有工作': {0: 'no', 1: 'yes'}}, 1: 'yes'}}"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "createTree(a,b)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
